Automatic Generation of Parallel Treebanks: An Efficient Unsupervised System
نویسندگان
چکیده
I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Doctor of Philosophy (Ph.D.) is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breach any law of copyright, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. The need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. This is true especially for parallel treebanks, of which very few exist. The ones that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this work I introduce a novel open-source platform for the fast and robust automatic generation of parallel treebanks through sub-tree alignment , using a limited amount of external resources. The intrinsic and ex-trinsic evaluations that I undertook demonstrate that my system is a feasible alternative to the manual annotation of parallel treebanks. Therefore, I expect the presented platform to help boost research in the field of syntax-augmented machine translation and lead to advancements in other fields where parallel treebanks can be employed. ! Acknowledgements I would like to thank my family for their full support and patience, and especially my wife Sveta and my son Yassen who had to put up with my living thousands of kilometres away from them. I dedicate this thesis to my grandparents Darinka and Dobrin in Bulgaria, who might be the happiest of all for the completion of my PhD degree. and Mary Hearne from the ATTEMPT project and the remaining NCLT and CNGL researchers at Dublin City University for the fruitful discussions and active collaboration. Special thanks go to Andy Way, who recruited me for the ATTEMPT project even though I did not have any prior experience in the field of machine translation. My appreciation goes to the members of Jim Boylan's Kenpo Karate Club in Dublin for their moral support and for giving me the opportunity to relax and forget all science twice a week.
منابع مشابه
Unsupervised Generation of Parallel Treebanks through Sub-Tree Alignment
e need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. is is true especially for parallel treebanks, of which very few exist. e ones that exist are mainly hand-craed and too small for reliable use in data-oriented applications. In this paper we introduce an open-source system for fast and robust automatic generation of para...
متن کاملPresentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کاملPresentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کاملAutomatic Generation of Parallel Treebanks
The need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. This is true especially for parallel treebanks, of which very few exist. The ones that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this paper we introduce a novel platform for fast and robust automatic generation of paral...
متن کاملTreebanks in Machine Translation
We present an approach using treebanks in machine translation. Our experiment in Czech-English machine translation is an attempt to develop a full machine translation system based on dependency trees (Dependency Based Machine Translation, DBMT). We use the following resources: Prague Dependency Treebank, a newly created Czech-English parallel corpus of Penn Treebank, English monolingual corpus,...
متن کامل